7 - Differential Gene Expression & Level 1 Annotation

Author

CDN team

Published

May 31, 2024

Introduction

In this notebook we are going to pick up from the clustering notebook and focus on how to annotate a single-cell dataset. We’re briefly going to touch on how to carry out differential expression between clusters but for a more in depth explanation visit our notebook from a previous workshop. Annotating single-cell data is the most laborious part of the analysis. It requires in depth knowledge of the cell types making up your biological system and multiple rounds of iterative clustering and subsetting to get to a fine grained annotation. Luckily, there are some tools, like label transfer, that can be used to help speed up this process. These tools rely on using a relevant and previously annotated reference to help automate this annotation process. These automated methods work great for coarse cell type labels but don’t perform quite as well when annotating fine grained cell states that may be not be found in our reference. Therefore, an approach combining both methodologies is usually used. A good resource going over this process is the paper by Luecken, M, et al..

Key Takeaways

  • To annotate our clusters we need to determine which genes are differentially expressed in each one.

  • Differentially expressed genes depend on which cell types we are comparing. The same cell type will have different differentially expressed genes if we change the other cell types in the dataset.

  • We can quantify these differentially expressed genes using effect size and discriminatory power metrics such as log2FC and AUC.

  • P values obtained from carrying out DGE analysis between clusters are inflated and should not be used.

  • Annotation is a laborious process that uses automated and manual approache in which we use automated methods to coarsley label cells and manual annotation to identify unique cell states.

  • Automated annotation requires relevant reference data that has been previously annotated. We will only be able to identify cell types that were anotated in the reference.

  • Manual annotation is based on literature knowledge and digging through previous annotation efforts or field-relevant papers describing cell types of interest using other methods - bulk RNAseq, FACS…

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages("RColorBrewer")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("openxlsx", quietly = TRUE))
    BiocManager::install("openxlsx")

if (!requireNamespace("presto", quietly = TRUE))
    devtools::install_github("immunogenomics/presto")

if (!requireNamespace("SeuratData", quietly = TRUE))
    devtools::install_github('satijalab/seurat-data')

### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(colorBlindness)
library(RColorBrewer)
library(DT)
library(ComplexHeatmap)
library(SeuratData)
library(openxlsx)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.

# Download the data in data/ directory
# download.file(
#     url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
#     destfile = "../data/workshop-data.rds",
#     method = "wget",
#     extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds

se <- readRDS("../data/Covid_Flu_Seurat_Object.rds")

Generate a color palette for our cell types

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
# nb.cols <- length(unique(se$Celltype))
# mycolors <- colorRampPalette(paletteMartin)(nb.cols)
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

Analysis

Quick processing

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
    RunPCA(verbose = FALSE)

ElbowPlot(se, ndims = 50)

Let’s run FindNeighbors and FindClusters to label our data:

se <- FindNeighbors(se, reduction = "pca") %>% 
    FindClusters(resolution = c(0.01, 0.05, 0.1, 0.25))
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9956
Number of communities: 4
Elapsed time: 13 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9829
Number of communities: 7
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9721
Number of communities: 9
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9532
Number of communities: 15
Elapsed time: 13 seconds

Let’s compute the UMAP

se <- se %>%
    RunUMAP(dims = 1:30, verbose = FALSE)

We can now take a look on the UMAP space how the clusters look

se$sample_id <- se$`Sample ID`
DimPlot(
    se,
    group.by = c(
        "RNA_snn_res.0.01", "RNA_snn_res.0.05",
        "RNA_snn_res.0.1", "RNA_snn_res.0.25"),
    label = TRUE)

dim_plt <- DimPlot(
    se,
    group.by = c("RNA_snn_res.0.05"),
    label = TRUE)

And the original cell type labels + the sample IDs

DimPlot(
    se,
    group.by = c("Celltype", "sample_id"),
    label = FALSE)

For the purpose of this tutorial we’re going to go forward with resolution 0.05!

DGE Wilcoxon

The different implementations Seurat incorporates provides in FindAllMarkers compare the gene expression between 2 groups of cells. This one vs all strategy is very quick and returns the avg_log2FC. This avg_log2FC is computed as detailed here & here. Since we’re working with normalized data the log2FC can be directly computed by subtracting the average expression between both groups - log_{2}(\frac{exp1}{exp2})=log_{2}(Avg\_exp1)-log_{2}(Avg\_exp2)

Idents(se) <- "RNA_snn_res.0.05"
mgs <- FindAllMarkers(
    se,
    test.use = "wilcox",
    slot = "data",
    only.pos = TRUE,
    logfc.threshold = 0.5,
    min.pct = 0.25)

Look at the results in a dynamic table:

DT::datatable(mgs, filter = "top")

Look at the results in a heatmap

top10 <- mgs %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 10) %>%
    ungroup()

DoHeatmap(se, features = top10$gene) + NoLegend()

Save marker genes to a spreadsheet

mgs_ls <- lapply(unique(mgs$cluster), function(i) {
    mgs %>% dplyr::filter(cluster == i)
    })

# Set names which will be the sheet name
names(mgs_ls) <- unique(mgs$cluster)

# Save marker genes to spreadsheet
openxlsx::write.xlsx(mgs_ls, file = "../data/lvl1_mgs.xlsx")
P-value interpretation

More details can be obtained in OSCA.

P values obtained from DGE analysis are inflated and, therefore invalid in their interpretation. We can’t use p-values to reject the Null Hypothesis since we are carrying out data snooping. This means that we are dividing the clusters based on their gene expression, and then computing p-values for the genes that are differentially expressed, even though we know these clusters have different gene expression patterns since we clustered the data based on them being different.

A way to show this is by looking at how skewed the distributions of the p-values obtained is:

# Compute the p-values without he thresholds
mgs2 <- FindAllMarkers(
    se,
    test.use = "wilcox",
    only.pos = TRUE,
    logfc.threshold = 0,
    min.pct = 0,
    return.thresh = 1,
    max.cells.per.ident = 100 # use 100 cells per cell type for speed
    )

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    # geom_histogram(alpha = 0.3, position = "identity") +
    geom_density(alpha = 0.3) +
    theme_minimal()

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    geom_histogram(alpha = 0.3, position = "identity") +
    facet_wrap(~cluster, scales = "free") +
    theme_minimal()

Annotation

There are two main ways to annotate your single cell dataset which are usually used together to aid in the process:

  • Automatic cell type annotation: typicall requires a reference dataset representative of the samples we are trying to annotate. Some commonly used tools in R are SingleR which has very good documentation and Seurat’s reference mapping either through their Azimuth server or following their reference mapping vignette which performs remarkably well.

Azimuth

We can use Azimuth online app to manually use a reference dataset to annotate our dataset.

To do so we need to save the counts matrix as an RDS file.

se@assays$RNA$counts[1:15, 1:15]
15 x 15 sparse Matrix of class "dgCMatrix"
                                      
TSPAN6   . . . . . . . . . . . . . . .
TNMD     . . . . . . . . . . . . . . .
DPM1     . . . . . . . 1 . 1 . . 1 . .
SCYL3    . . . . . . . . . . . . . . .
C1orf112 . . . . . 1 . . . . . . . . .
FGR      . . 1 2 . . 3 . 1 1 1 . 2 . .
CFH      . . . . . . . . . . . . . . .
FUCA2    . . . . . . 1 . . . . . . . .
GCLC     . . . . 3 . 2 . . 1 . . . . .
NFYA     . . . . . . . . . . . . . . .
STPG1    . . . . . . . . . . . . . . .
NIPAL3   . 1 . . . . . . . . . . . . .
LAS1L    . . . . . . . . . . . . . . .
ENPP4    . . . . . . . 1 . . . 1 . . .
SEMA3F   . . . . . . . . . . . . . . .
saveRDS(
    # subsample our data for a quick example
    object = se[, sample(colnames(se), .3 * ncol(se))]@assays$RNA$counts,
    file = "../data/counts.rds")

Reference Mapping

We are going to download a reference human PBMC dataset using the SeuratData package. You can use whichever reference suits your dataset best.

SeuratData::AvailableData()
                                   Dataset Version                                                        Summary    species            system ncells                                                            tech seurat default.dataset disk.datasets                               other.datasets                                                                     notes Installed InstalledVersion
adiposeref.SeuratData           adiposeref   1.0.0                                     Azimuth Reference: adipose      human           adipose 160075                                          scRNA-seq and sNuc-seq   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
bmcite.SeuratData                   bmcite   0.3.0                                          30k Bone Marrow Cells      human       bone marrow  30672                                                            <NA>  3.2.2            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
bonemarrowref.SeuratData     bonemarrowref   1.0.0                                  Azimuth Reference: bonemarrow      human        bonemarrow 297627                                                          10x v2   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
cbmc.SeuratData                       cbmc   3.1.4                   scRNAseq and 13-antibody sequencing of CBMCs      human CBMC (cord blood)   8617                                                        CITE-seq  3.1.4             raw     processed                                         <NA>                                                                      <NA>     FALSE             <NA>
celegans.embryo.SeuratData celegans.embryo   0.1.0         6k C. elegans embryos from Packer and Zhu et al (2019) C. elegans            embryo   6188                                                            <NA>   <NA>             raw          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
fetusref.SeuratData               fetusref   1.0.0                                       Azimuth Reference: fetus      human             fetus 377456                                                            <NA>   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
hcabm40k.SeuratData               hcabm40k   3.0.0 40,000 Cells From the Human Cell Atlas ICA Bone Marrow Dataset      human       bone marrow  40000                                                          10x v2   <NA>             raw          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
heartref.SeuratData               heartref   1.0.0                                       Azimuth Reference: heart      human             heart 656509                                          scRNA-seq and sNuc-seq   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
humancortexref.SeuratData   humancortexref   1.0.0                                 Azimuth Reference: humancortex      human      motor cortex  76533                                                            <NA>   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
ifnb.SeuratData                       ifnb   3.1.0                              IFNB-Stimulated and Control PBMCs      human              PBMC  13999                                                          10x v1   <NA>             raw          <NA>                                    processed                                                                      <NA>     FALSE             <NA>
kidneyref.SeuratData             kidneyref   1.0.2                                      Azimuth Reference: kidney      human            kidney  64693                                                       snRNA-seq   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
lungref.SeuratData                 lungref   2.0.0                                        Azimuth Reference: lung      human              lung 584884                                                            <NA>   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
mousecortexref.SeuratData   mousecortexref   1.0.0                                 Azimuth Reference: mousecortex      mouse      motor cortex 159738                                                          10x v3   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
panc8.SeuratData                     panc8   3.0.2               Eight Pancreas Datasets Across Five Technologies      human Pancreatic Islets  14892                SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops   <NA>             raw          <NA>                                         <NA>                                                                      <NA>      TRUE            3.0.2
pancreasref.SeuratData         pancreasref   1.0.0                                    Azimuth Reference: pancreas      human          pancreas  35289                                                            <NA>   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
pbmc3k.SeuratData                   pbmc3k   3.1.4                                     3k PBMCs from 10X Genomics      human              PBMC   2700                                                          10x v1  3.1.4             raw          <NA>                                 pbmc3k.final                                                                      <NA>      TRUE            3.1.4
pbmcMultiome.SeuratData       pbmcMultiome   0.1.4                             10X Genomics PBMC Multiome Dataset      human              pbmc  11909                                                            <NA>  4.1.1              NA          <NA>                          pbmc.rna, pbmc.atac                              One sample with two modalities, RNA and ATAC     FALSE             <NA>
pbmcref.SeuratData                 pbmcref   1.0.0                                        Azimuth Reference: pbmc      human              PBMC   2700                                                          10x v1   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>      TRUE            1.0.0
pbmcsca.SeuratData                 pbmcsca   3.0.0           Broad Institute PBMC Systematic Comparative Analysis      human              PBMC  31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2   <NA>             raw          <NA>                                         <NA>                                                             HCA benchmark      TRUE            3.0.0
ssHippo.SeuratData                 ssHippo   3.1.4                      Slide-seq v2 dataset of mouse hippocampus      mouse       hippocampus     NA                                                     slideseq v2   <NA>             raw          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
stxBrain.SeuratData               stxBrain   0.1.1                        10X Genomics Visium Mouse Brain Dataset      mouse             brain  12167                                                          visium   <NA>              NA          <NA> posterior1, posterior2, anterior1, anterior2 One sample split across four datasets as paired anterior/posterior slices     FALSE             <NA>
stxKidney.SeuratData             stxKidney   0.1.0                       10X Genomics Visium Mouse Kidney Dataset      mouse            kidney   1438                                                          visium   <NA>             raw          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
thp1.eccite.SeuratData         thp1.eccite   3.1.5                                               ECCITE-seq THP-1      human              <NA>     NA                                                            <NA>   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
tonsilref.SeuratData             tonsilref   2.0.0                                      Azimuth Reference: tonsil      human            tonsil 377963                                                       scRNA-seq   <NA>            <NA>          <NA>                                         <NA>                                                                      <NA>     FALSE             <NA>
# SeuratData::InstallData("pbmcsca", force.reinstall = TRUE)
SeuratData::InstalledData()
                   Dataset Version                                              Summary species            system ncells                                                            tech seurat default.dataset disk.datasets other.datasets         notes Installed InstalledVersion
panc8.SeuratData     panc8   3.0.2     Eight Pancreas Datasets Across Five Technologies   human Pancreatic Islets  14892                SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops   <NA>             raw          <NA>           <NA>          <NA>      TRUE            3.0.2
pbmc3k.SeuratData   pbmc3k   3.1.4                           3k PBMCs from 10X Genomics   human              PBMC   2700                                                          10x v1  3.1.4             raw          <NA>   pbmc3k.final          <NA>      TRUE            3.1.4
pbmcref.SeuratData pbmcref   1.0.0                              Azimuth Reference: pbmc   human              PBMC   2700                                                          10x v1   <NA>            <NA>          <NA>           <NA>          <NA>      TRUE            1.0.0
pbmcsca.SeuratData pbmcsca   3.0.0 Broad Institute PBMC Systematic Comparative Analysis   human              PBMC  31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2   <NA>             raw          <NA>           <NA> HCA benchmark      TRUE            3.0.0
ref <- LoadData("pbmcsca")

Look at the data

table(ref$CellType)

                     B cell              CD14+ monocyte              CD16+ monocyte                 CD4+ T cell            Cytotoxic T cell              Dendritic cell               Megakaryocyte         Natural killer cell Plasmacytoid dendritic cell                  Unassigned 
                       5020                        5550                         804                        7391                        9071                         433                         977                        1565                         164                          46 

Preprocess data - Normalize, identify HVG and look at the Elbow plot. We start by normalizing the data just to make sure both datasets are processed the same way. We then use the union of the HVG to make sure we capture the biological variability of both datasets. Next we compute the UMAP, it is necessary to specify return.model=TRUE! Lastly, we look at the Elbow plot to assess how many dimensions to use in downstream analysis.

# Renormalize data and find HVG to make sure 
se <- NormalizeData(se) %>% FindVariableFeatures(method = "vst", nfeatures = 3000)
ref <- NormalizeData(ref) %>%
    FindVariableFeatures(method = "vst", nfeatures = 3000) %>%
    ScaleData() %>%
    RunPCA()

ElbowPlot(ref, ndims = 30)

ref <- ref %>% Seurat::RunUMAP(reduction = "pca", dims = 1:30, return.model = TRUE)

# Use the union of HVG
hvg <- union(VariableFeatures(se), VariableFeatures(ref))
length(hvg)
[1] 5464

Reference mapping - there is very good explanation in the documentation of ?FindTransferAnchors. Basically what we are doing here is embedding the cells into the same latent space. Anchors - defined as pairs of cells contained within each other’s k-neighborhood - are identified. Lastly, with TransferData the shared nearest neighbors overlap between the anchors and the cells is used to obtain the that cell’s predicted label.

# Find anchors in reference dataers
anchors <- FindTransferAnchors(
    reference = ref,
    query = se,
    dims = 1:30,
    reference.reduction = "pca",
    features = hvg,
    normalization.method = "LogNormalize")

# Extract the predictions
predictions <- TransferData(
    anchorset = anchors,
    refdata = ref$CellType,
    dims = 1:30)

head(predictions)
                           predicted.id prediction.score.Cytotoxic.T.cell prediction.score.CD4..T.cell prediction.score.CD14..monocyte prediction.score.B.cell prediction.score.Megakaryocyte prediction.score.CD16..monocyte prediction.score.Natural.killer.cell prediction.score.Dendritic.cell prediction.score.Plasmacytoid.dendritic.cell prediction.score.Unassigned prediction.score.max
AAACCCAAGAATGTTG-12 Natural killer cell                        0.21689575                            0                               0                       0                              0                               0                           0.78310425                               0                                            0                           0            0.7831042
AAACCCAAGCATTGTC-19         CD4+ T cell                        0.00000000                            1                               0                       0                              0                               0                           0.00000000                               0                                            0                           0            1.0000000
AAACCCAAGCTACGTT-6     Cytotoxic T cell                        0.98981523                            0                               0                       0                              0                               0                           0.01018477                               0                                            0                           0            0.9898152
AAACCCAAGGCCGCTT-3       CD14+ monocyte                        0.00000000                            0                               1                       0                              0                               0                           0.00000000                               0                                            0                           0            1.0000000
AAACCCAAGGCCTGCT-12 Natural killer cell                        0.08903623                            0                               0                       0                              0                               0                           0.91096377                               0                                            0                           0            0.9109638
AAACCCAAGGGCAATC-1               B cell                        0.00000000                            0                               0                       1                              0                               0                           0.00000000                               0                                            0                           0            1.0000000

Let’s add these predictions to our seurat object

colnames(predictions)
 [1] "predicted.id"                                 "prediction.score.Cytotoxic.T.cell"            "prediction.score.CD4..T.cell"                 "prediction.score.CD14..monocyte"              "prediction.score.B.cell"                      "prediction.score.Megakaryocyte"               "prediction.score.CD16..monocyte"              "prediction.score.Natural.killer.cell"         "prediction.score.Dendritic.cell"              "prediction.score.Plasmacytoid.dendritic.cell" "prediction.score.Unassigned"                  "prediction.score.max"                        
se <- AddMetaData(
    se,
    metadata = predictions[, c("predicted.id", "prediction.score.max", "prediction.score.Unassigned")])

Look at the label transfer

DimPlot(se, group.by = c("Celltype", "predicted.id"))

Since we also got a confidence value for each cell we can visualize it

FeaturePlot(se, features = c("prediction.score.max", "prediction.score.Unassigned"))

Manual Annotation

Cluster 0 & 4

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("CD3D", "CD3D", "TRAC", "TRBC2", "CD8B", "CD4")) +
    dim_plt

VlnPlot(
    se,
    features = c("CD3D", "CD3D", "TRAC", "TRBC2", "CD8B", "CD4"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Clusters 0 & 4 seem to have a lot of expression of T cell related genes so at this level 1 we are going to label them as T cells.

Cluster 1

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("CD14", "S100A8", "VCAN", "LYZ", "MS4A6A")) +
    dim_plt

VlnPlot(
    se,
    features = c("CD14", "S100A8", "VCAN", "LYZ", "MS4A6A"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 1 is expressing a lot of monocyte-like genes, at this level 1 we are going to label them as monocytes.

Cluster 2

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("MS4A1", "CD79A", "CD79B", "IGHD", "IGHM")) +
    dim_plt

VlnPlot(
    se,
    features = c("MS4A1", "CD79A", "CD79B", "IGHD", "IGHM"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 2 is expressing B cells genes

Cluster 3

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("PF4", "GP9", "PPBP")) +
    dim_plt

VlnPlot(
    se,
    features = c("PF4", "GP9", "PPBP"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 3 is expressing platelet genes

Cluster 5

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("HBA1", "HBA2", "HBB", "HBD")) +
    dim_plt

VlnPlot(
    se,
    features = c("HBA1", "HBA2", "HBB", "HBD"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 5 is expressing hemoglobin genes so they are likely RBC

Cluster 6

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("SHD", "SHC", "LILRA4", "CLEC4C", "IL3RA", "IRF4")) +
    dim_plt

VlnPlot(
    se,
    features = c("SHD", "SHC", "LILRA4", "CLEC4C", "IL3RA", "IRF4"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 6 is expressing genes predominantly expressed by pDCs.

Annotate

According to the markers observed we can make a first general annotation

se@meta.data <- se@meta.data %>%
  dplyr::mutate(
    annotation_lvl1 = dplyr::case_when(
      RNA_snn_res.0.05 == 0 ~ "T cells",
      RNA_snn_res.0.05 == 1 ~ "Monocytes", #
      RNA_snn_res.0.05 == 2 ~ "B cells",
      RNA_snn_res.0.05 == 3 ~ "Platelets",
      RNA_snn_res.0.05 == 4 ~ "T cells", # 
      RNA_snn_res.0.05 == 5 ~ "RBC",
      RNA_snn_res.0.05 == 6 ~ "pDCs")
  )

DimPlot(se, group.by = "annotation_lvl1")

Summary genes

We can visualize this as a dotplot

order <- c("T cells", "Monocytes", "B cells", "Platelets", "RBC", "pDCs")

se$annotation_lvl1_ord <- factor(
  x = se$annotation_lvl1,
  levels = order)

## Genes for DOTPLOT
dplot_genes <- c(
    # T cell genes
    "CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD8A", "CD8B", "CD4",
    # Monocytes
    "CD14", "S100A8", "VCAN", "LYZ", "MS4A6A",
    # B cells
    "MS4A1", "CD79A", "CD79B", "IGHD", "IGHM",
    # Platelets
    "PF4", "GP9", "PPBP",
    # RBS
    "HBA1", "HBA2", "HBB", "HBD",
    #pDCs
    "LILRA4", "CLEC4C", "IL3RA", "IRF4"
  )

Seurat::DotPlot(
  object = se,
  features = dplot_genes,
  group.by = "annotation_lvl1_ord",
  col.min = 0,
  dot.min = 0) +
  ggplot2::scale_x_discrete(
    breaks = dplot_genes) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 60, hjust = 1)) +
  ggplot2::labs(x = "", y = "")

We can also visualize this as a heatmap using Seurat’s DoHeatmap function:

Seurat::DoHeatmap(
    se,
    features = dplot_genes,
    group.by = "annotation_lvl1_ord")

Or with pheatmap

# Cell types in the order the gdotplot genes are set
ct_vec <- c("T cells", "Monocytes", "B cells", "Platelets", "RBC", "pDCs")

lvl1_pal <- c("purple", "orange", "forestgreen", "lightblue", "darkred", "black")
names(lvl1_pal) <- ct_vec
# Subset to 30% of the dataset
se_30 <- se[, sample(colnames(se), 0.5 * ncol(se))]

hm_ls <- lapply(ct_vec, function(i) {
    se_sub <- se_30[, se_30$annotation_lvl1 == i]
    # Extract Gene Expression Matrix from Seurat Object
    gene_expr <- GetAssayData(se_sub, assay = "RNA", slot = "scale.data")
    
    # Subset the genes intersecting between gene expression and genes of interest
    g_int <- dplot_genes[dplot_genes %in% rownames(gene_expr)]
    
    # Subset expression matrix to only genes of interest
    gene_expr <- gene_expr[g_int, ]
    
    # Add the score of the signature as annotation in the heatmap
    colAnn <- ComplexHeatmap::HeatmapAnnotation(
        df = se_sub@meta.data[, c("Celltype", "annotation_lvl1"), drop = FALSE],
        # name = "Celltype",
        which = 'column',
        col = list("Celltype" = pal, "annotation_lvl1" = lvl1_pal),
        show_annotation_name = FALSE)
    
    # Visualize the Heatmap with the genes and signature 
    ComplexHeatmap::Heatmap(
        as.matrix(gene_expr),
        name = "Scaled Gene Expression",
        # col = expr_cols,
        cluster_rows = FALSE,
        cluster_columns = TRUE,
        # column_title = sig_name,
        column_names_gp = gpar(fontsize = 14),
        show_column_names = FALSE,
        top_annotation = colAnn,
        )
})

# Return ComplexHeatmap
hm_ls[[1]] + hm_ls[[2]] + hm_ls[[3]] + hm_ls[[4]] + hm_ls[[5]] + hm_ls[[6]]

Annotation agreement

Lastly let’s see if our Manual annotation agrees with the reference annotation:

DimPlot(
    se,
    group.by = c("annotation_lvl1", "predicted.id", "Celltype"),
    ncol = 2)

We can also check the overlap between manual and automatic annotation as follows:

se@meta.data %>%
    # Count the instances each combination of annotations happen
    dplyr::count(annotation_lvl1, predicted.id) %>%
    ggplot(aes(x = annotation_lvl1, y = predicted.id, fill = n, color = n, label = n)) +
    geom_tile(color = "lightgrey") +
    geom_text(color = "lightgrey") +
    scale_fill_viridis_c() +
    theme_classic() +
    theme(legend.position = "none")

Check the overlap between manual and the author provided annotation:

se@meta.data %>%
    # Count the instances each combination of annotations happen
    dplyr::count(annotation_lvl1, Celltype) %>%
    ggplot(aes(x = annotation_lvl1, y = Celltype, fill = n, color = n, label = n)) +
    geom_tile(color = "lightgrey") +
    geom_text(color = "lightgrey") +
    scale_fill_viridis_c() +
    theme_classic() +
    theme(legend.position = "none")

Extra!!!

See below how the avg_log2FC calculation is done! Code extracted from Seurat’s codebase.

features <- rownames(se) == "MS4A1"
cells.1 <- se$Celltype == "B cell, IgG+"
cells.2 <- se$Celltype != "B cell, IgG+"
data.use <- GetAssayData(object = se, assay.type = "RNA", slot = "data")
pseudocount.use <- 1
base <- 2

# Calculate fold change
mean.fxn <- function(x) {
    return(log(x = (rowSums(x = expm1(x = x)) + pseudocount.use)/NCOL(x), base = base))
  }

data.1 <- mean.fxn(data.use[features, cells.1, drop = FALSE])
data.2 <- mean.fxn(data.use[features, cells.2, drop = FALSE])

# Look at log2FC
(fc <- (data.1 - data.2))

Check if its equal to the avg_log2FC obtained from FindAllMarkers:

fc == mgs[mgs$cluster == "B cell, IgG+" & mgs$gene == "MS4A1", "avg_log2FC"]

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] openxlsx_4.2.5.2         pbmcsca.SeuratData_3.0.0 pbmcref.SeuratData_1.0.0 pbmc3k.SeuratData_3.1.4  panc8.SeuratData_3.0.2   SeuratData_0.2.2.9001    ComplexHeatmap_2.16.0    DT_0.33                  RColorBrewer_1.1-3       colorBlindness_0.1.9     lubridate_1.9.2          forcats_1.0.0            stringr_1.5.1            dplyr_1.1.4              purrr_1.0.2              readr_2.1.4              tidyr_1.3.1              tibble_3.2.1             ggplot2_3.5.0            tidyverse_2.0.0          Seurat_5.0.3             SeuratObject_5.0.2       sp_2.1-3                

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            polyclip_1.10-6        fastDummies_1.7.3      lifecycle_1.0.4        doParallel_1.0.17      globals_0.16.2         lattice_0.21-8         MASS_7.3-60            crosstalk_1.2.1        magrittr_2.0.3         sass_0.4.8             limma_3.56.2           plotly_4.10.4          rmarkdown_2.26         jquerylib_0.1.4        yaml_2.3.8             httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            zip_2.3.0              spatstat.sparse_3.0-3  reticulate_1.35.0.9000 cowplot_1.1.3          pbapply_1.7-2          abind_1.4-5            Rtsne_0.17             presto_1.0.0           BiocGenerics_0.46.0    rappdirs_0.3.3         circlize_0.4.15        IRanges_2.34.1         S4Vectors_0.38.1       ggrepel_0.9.5          irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-2  fitdistrplus_1.1-11    parallelly_1.37.0      leiden_0.4.3.1         codetools_0.2-19       tidyselect_1.2.0       shape_1.4.6            farver_2.1.1           matrixStats_1.2.0      stats4_4.3.1           spatstat.explore_3.2-6
 [52] jsonlite_1.8.8         GetoptLong_1.0.5       ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6         survival_3.5-7         iterators_1.0.14       foreach_1.5.2          tools_4.3.1            ica_1.0-3              Rcpp_1.0.12            glue_1.7.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            fastmap_1.1.1          fansi_1.0.6            digest_0.6.34          timechange_0.2.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       Cairo_1.6-1            scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.0      httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          htmltools_0.5.7        dotCall64_1.1-1        clue_0.3-64            scales_1.3.0           png_0.1-8              knitr_1.45             rstudioapi_0.15.0      tzdb_0.4.0             reshape2_1.4.4         rjson_0.2.21           nlme_3.1-163           cachem_1.0.8           zoo_1.8-12             GlobalOptions_0.1.2    KernSmooth_2.23-22    
[103] vipor_0.4.5            parallel_4.3.1         miniUI_0.1.1.1         ggrastr_1.0.2          pillar_1.9.0           vctrs_0.6.5            RANN_2.6.1             promises_1.2.1         xtable_1.8-4           cluster_2.1.4          beeswarm_0.4.0         evaluate_0.23          magick_2.7.5           cli_3.6.2              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         ggbeeswarm_0.7.2       plyr_1.8.9             stringi_1.8.3          viridisLite_0.4.2      deldir_2.0-2           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-8    Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2           bslib_0.6.1